AITopics | better reward

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Neural Information Processing SystemsDec-26-2025, 15:16:28 GMT

Reinforcement learning from human feedback (RLHF)---where human preference judgments on LM outputs are transformed into a learning signal---has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with this reward function leads to improved performance, supported by both automatic and human evaluation. Additionally, we show that LM behaviors can be customized using different combinations of fine-grained reward models. We release all data, collected human feedback, and codes at https://FineGrainedRLHF.github.io.

better reward, language model training, name change, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.82)

Add feedback

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Neural Information Processing SystemsJan-19-2025, 20:20:32 GMT

Reinforcement learning from human feedback (RLHF)---where human preference judgments on LM outputs are transformed into a learning signal---has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with this reward function leads to improved performance, supported by both automatic and human evaluation.

better reward, human feedback, language model training, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)

Add feedback

Reinforcement Learning 6: Exploration vs Exploitation

#artificialintelligenceAug-23-2021, 03:30:16 GMT

Before getting into Q-Learning, lets understand the importance concept of Exploration vs Exploitation. Exploration is trying new possibilities in order to find better rewards. Exploitation is to keep opting for same actions that has given some significant rewards in the past. To illustrate this we can consider an example of restaurants. Suppose i recently start living in a new city.

exploitation, exploration and exploitation, restaurant, (7 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)

Add feedback

Markov Decision Processes with Ordinal Rewards: Reference Point-Based Preferences

Weng, Paul (LIP6, UPMC)

AAAI ConferencesMay-18-2011

In a standard Markov decision process (MDP), rewards are assumed to be precisely known and of quantitative nature. This can be a too strong hypothesis in some situations. When rewards can really be modeled numerically, specifying the reward function is often difficult as it is a cognitively-demanding and/or time-consuming task. Besides, rewards can sometimes be of qualitative nature as when they represent qualitative risk levels for instance. In those cases, it is problematic to use directly standard MDPs and we propose instead to resort to MDPs with ordinal rewards. Only a total order over rewards is assumed to be known. In this setting, we explain how an alternative way to define expressive and interpretable preferences using reference points can be exploited.

history, preference relation, reference point, (15 more...)

AAAI Conferences

Twenty-First International Conference on Automated Planning and Scheduling

Country: Europe > France > Île-de-France > Paris > Paris (0.04)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology: